This document shows the development of a vision-based driving system for SuperTuxKart, a popular open-source racing game. To achieve this, first a low-level controller that serves as an auto-pilot for driving in the game will be designed. This controller will enable the vehicle to move, steer, and brake according to a set of predefined rules. Once we have the auto-pilot working, it will be used to train a vision-based driving system. By combining the designed controller with computer vision techniques, an intelligent driving system that can navigate through the game’s tracks with ease can be created.
I have provided a brief review of topics covered in this report:
Convolutional Neural Networks (CNNs) are widely used in computer vision tasks, particularly in image and video recognition. CNNs apply a series of convolutional filters to the input image, which allows the network to detect various features and patterns. These filters are then passed through a non-linear activation function and downsampled using pooling layers. Finally, the resulting features are fed into fully connected layers for classification or regression tasks.
Fully Convolutional Networks (FCNs) are a type of neural network that were developed for semantic segmentation tasks, where the goal is to classify each pixel in an image. Unlike traditional neural networks, FCNs replace the fully connected layers with convolutional layers, enabling the network to accept inputs of arbitrary size. The output of an FCN is a pixel-wise classification map, which can be upsampled to the original size of the input image. FCNs have achieved state-of-the-art performance in many semantic segmentation tasks, including those that require pixel-level classification.
It’s important to note that FCNs are actually a type of CNN. The main difference is that FCNs use convolutional layers exclusively, whereas traditional CNNs typically include fully connected layers for classification or regression tasks. By replacing the fully connected layers with convolutional layers, FCNs can perform pixel-wise classification, making them ideal for semantic segmentation tasks. In contrast, traditional CNNs are often used for image classification or object recognition tasks, where the goal is to classify the entire image or a region of interest within the image.
Computer vision is a subset of artificial intelligence that enables computers to interpret and analyze images and video data. It involves the development of algorithms and techniques that can extract meaningful information from visual data, such as recognizing objects, detecting patterns, and identifying faces. Computer vision has a wide range of applications, including robotics, self-driving cars, medical imaging, and security systems. With the recent advancements in deep learning, computer vision has become increasingly sophisticated and accurate, enabling machines to perform tasks that were once thought impossible.
The low-level controller allows the autonomous kart to take in an aim point and the current velocity of the car as inputs. The aim point is a point on the center of the track, 15 meters away from the kart. The code for the controller has been developed to ensure accurate steering, acceleration, braking, and drifting to complete each course within a specified time.
The below table shows the time constraints that each course needed to be completed within:
Course Time Constraints
| zengarden | lighthouse | hacienda | snowtuxpeak | cornfield_crossing | scotland |
|---|---|---|---|---|---|
| 50s | 50s | 60s | 60s | 70s | 70s |
Input
aim_point: The aim-point directing where the cart is heading current_vel: The current velocity of the kart
Output
action: The next step that is taken for the kart to follow the given aim-point at the given current velocity
def control(aim_point, current_vel):
action = pystk.Action()
target_vel = 27.5
action.acceleration = np.clip((target_vel - current_vel) / 10, -1, 1)
action.brake = np.linalg.norm(aim_point) > 1.12 and current_vel > target_vel * 1.1
action.steer = max(-1, min(1, (np.arctan2(aim_point[0], -aim_point[1]) * 3.8 / (1 + np.e**(-0.8 * (np.linalg.norm(aim_point) - 0.5)))) / np.pi))
action.drift = abs(action.steer) > 0.511
return action
Below shows a snapshot of the auto-pilot working on the snowtuxpeak course:
In this image you can see a red circle which indicates the aim-point and how the kart controlled by the autopilot is detecting that point and following it. The blue circle indicates the center of the kart.
In this next image you can see the kart drifting to keep up with the aim-point again on the snowtuxpeak course:
The auto-pilot created with the controller function above is then used to generate a training dataset for our FCN model to create our vision based driving system.
Below is a snapshot of some of the images that were generated as training images from the auto-pilot controller:
The drive dataset will then be used to train the FCN show below in the planner model.
This planner is responsible for taking an image as input and then outputting the corresponding aim point in the image coordinate. Once the aim point is predicted, the designed controller will map those points to appropriate actions.
class Planner(torch.nn.Module):
class Block(torch.nn.Module):
def __init__(self, n_input, n_output, kernel_size=3, stride=2):
super().__init__()
self.c1 = torch.nn.Conv2d(n_input, n_output, kernel_size=kernel_size, padding=kernel_size // 2,
stride=stride)
self.c2 = torch.nn.Conv2d(n_output, n_output, kernel_size=kernel_size, padding=kernel_size // 2)
self.c3 = torch.nn.Conv2d(n_output, n_output, kernel_size=kernel_size, padding=kernel_size // 2)
self.b1 = torch.nn.BatchNorm2d(n_output)
self.b2 = torch.nn.BatchNorm2d(n_output)
self.b3 = torch.nn.BatchNorm2d(n_output)
self.skip = torch.nn.Conv2d(n_input, n_output, kernel_size=1, stride=stride)
def forward(self, x):
return F.relu(self.b3(self.c3(F.relu(self.b2(self.c2(F.relu(self.b1(self.c1(x)))))))) + self.skip(x))
class UpBlock(torch.nn.Module):
def __init__(self, n_input, n_output, kernel_size=3, stride=2):
super().__init__()
self.c1 = torch.nn.ConvTranspose2d(n_input, n_output, kernel_size=kernel_size, padding=kernel_size // 2,
stride=stride, output_padding=1)
def forward(self, x):
return F.relu(self.c1(x))
def __init__(self, layers=[16,32,64,128], n_class=2, kernel_size=3, use_skip=True):
super().__init__()
self.input_mean = torch.Tensor([0.2788, 0.2657, 0.2629])
self.input_std = torch.Tensor([0.2064, 0.1944, 0.2252])
c = 3
self.use_skip = use_skip
self.n_conv = len(layers)
skip_layer_size = [3] + layers[:-1]
for i, l in enumerate(layers):
self.add_module('conv%d' % i, self.Block(c, l, kernel_size, 2))
c = l
for i, l in list(enumerate(layers))[::-1]:
self.add_module('upconv%d' % i, self.UpBlock(c, l, kernel_size, 2))
c = l
if self.use_skip:
c += skip_layer_size[i]
self.classifier = torch.nn.Conv2d(c, n_class, 1)
self.size = torch.nn.Conv2d(c, 2, 1)
def forward(self, x):
z = (x - self.input_mean[None, :, None, None].to(x.device)) / self.input_std[None, :, None, None].to(x.device)
up_activation = []
for i in range(self.n_conv):
up_activation.append(z)
z = self._modules['conv%d' % i](z)
for i in reversed(range(self.n_conv)):
z = self._modules['upconv%d' % i](z)
z = z[:, :, :up_activation[i].size(2), :up_activation[i].size(3)]
if self.use_skip:
z = torch.cat([z, up_activation[i]], dim=1)
return spatial_argmax(self.classifier(z)[:,0,:,:])
The above code exhibits an FCN model for image processing tasks that
predicts the aiming point of an image in image coordinates (x: 0..127,
y: 0..95). This model is composed of three major components, namely the
Planner class, Block class, and
UpBlock class.
The Block class is responsible for constructing the
fundamental building block for the model. It takes input data and
utilizes various convolution layers alongside batch normalization and
rectified linear unit (ReLU) activation functions. In addition, it
implements skip connections to create a residual layer. The forward
function executes the transformations in a forward pass to generate the
output.
The UpBlock class constructs the up-sampling blocks for
the model. It takes input data and applies a convolution transpose layer
in conjunction with batch normalization and ReLU activation functions.
The forward function applies the transformations in a forward pass to
generate the output.
The Planner class is the primary class that defines the
architecture of the model. It accepts a list of layer sizes as input and
leverages these sizes to create the convolution and up-sampling layers
in the model. It also implements skip connections, provided the
use_skip argument. The forward function executes the
transformations in a forward pass to generate the output. It first
normalizes the input using the input_mean and
input_std tensors, then applies the convolution and
up-sampling layers to the input in a sequential manner. Finally, it
applies the spatial_argmax() function to generate the
output, which predicts the x and y coordinates of the aim point.
The below function is utilized within the planner function and essentially is a utility function that calculates the coordinates of the highest probability prediction in an input tensor.
def spatial_argmax(logit):
weights = F.softmax(logit.view(logit.size(0), -1), dim=-1).view_as(logit)
x_y_tensor = torch.stack(((weights.sum(1) * torch.linspace(-1, 1, logit.size(2)).to(logit.device)[None]).sum(1),
(weights.sum(2) * torch.linspace(-1, 1, logit.size(1)).to(logit.device)[None]).sum(1)), 1)
return x_y_tensor
It takes as input a logit tensor and uses F.softmax() to
normalize the tensor along the second dimension to obtain a probability
distribution. It then calculates the weighted sum of the x and y
coordinates using the probabilities along the x and y dimensions of the
probability distribution, respectively. The function returns the stacked
x and y coordinates as a 2D tensor. This function is used in the
Planner class to obtain the predicted aim point from the
output of the neural network.
Input
logit: a logit tensor from the training drive_data created from the controller
Output
x_y_tensor: a 2D tensor that contains stacked x and y coordinates of the predicted aim-point with the highest probability
The code snippet provided below defines a training function that
trains the deep learning model defined in the Planner
class.
def train(args):
from os import path
model = Planner()
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = Planner().to(device)
if args.continue_training:
model.load_state_dict(torch.load(path.join(path.dirname(path.abspath(__file__)), 'planner.th')))
optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate, weight_decay=1e-5)
import inspect
transform = eval(args.transform, {k: v for k, v in inspect.getmembers(dense_transforms) if inspect.isclass(v)})
train_data = load_data('drive_data', num_workers=4,transform=transform)
loss = torch.nn.MSELoss()
global_step = 0
for epoch in range(args.num_epoch):
model.train()
for img, label in train_data:
img, label = img.to(device), label.to(device)
logit = model(img)
det_loss_val = loss(logit, label)
loss_val = det_loss_val
optimizer.zero_grad()
loss_val.backward()
optimizer.step()
global_step += 1
In this function, the Adam optimizer is used, which takes a specified
learning rate and weight decay as inputs. Training data is loaded using
the load_data function, which retrieves data from the
drive_data directory and applies a designated
transformation to it- in this case the transformation is random
horizontal flip and a conversion of the data to a tensor.
For the purposes of training, the Mean Squared Error (MSE) loss function is utilized. During the training loop, which continues for a specified number of epochs, batches of data are fed to the model, and the optimizer updates the model parameters.
After training the FCN on the drive dataset created using the designed controller, the model was able to run accurately and complete all courses in the allotted time.
Below shows a screenshot of the planner model working and driving utilizing auto-pilot functionality:
In this image, you can see that there is an additional circle colored green. This green circle represents the predicted aim-point from the FCN. The red still indicating the actual aim-point and the blue representing the center of the kart.
In this document, we detail the creation of a vision-based driving system for SuperTuxKart, a widely-used open-source racing game, using advanced deep learning and computer vision techniques. Our journey begins with an overview of convolution neural networks (CNNs) and fully convolution networks (FCNs), which are critical to understanding the development of our vision-based driving system.
Next, we delve into the creation of a low-level controller that enables the autonomous kart to move, steer, brake, and drift based on a pre-determined set of rules. This controller processes an aim point and the current velocity of the car as input, and generates the next action that the kart takes to track the given aim-point at the present velocity. The auto-pilot controller is subsequently employed to create a training dataset for the FCN model.
Lastly, we describe the development of an FCN-based planner that takes an image as input and predicts the aim-point for the low-level controller. The planner is trained on the dataset produced by the auto-pilot controller and accurately forecasts the aim-point for the kart to follow based on the current image.